Check 2 – No high or medium service has recently restarted
Process
• Command: livesp-stability
• API Endpoint: /api/v1/service/stability
• Status: equals “OK” if all services are running since at least 2 days else “KO”.
• Severity: equals “NONE” if all services are running since at least 2 days else equals the highest criticality of the services that are not running since at least 2 days.
• Messages: details for each service that is not running since at least 2 days (criticality part is detailed in the above paragraph).
Support action
1. Check whether a manual operation occurred in last 2 days (if the service was manually killed, a LiveSP installation or upgrade occurred, or the server was rebooted for example, then this is normal).
2. What was the system message when it failed?
• Run command dksps
3. Was there a docker heartbeat issue (network communication issue with the manager) on the node of this service? (requires sudo rights)
• Go to hosting server with ssh $(getSwarmContainerNodeIP )
• Run command sudo journalctl -u docker --since "2 days ago" | grep 'hearbeat'
4. Check the applicative logs of the service before the service restarted
• Look for logs in /data/logs or in the ELK user interface
Example
$ livesp-stability
{
"name": "/api/v1/service/stability",
"timestamp": "2020-08-05T14:15:54Z",
"status": "KO",
"severity": "HIGH",
"messages": [
"KO - livesp_bach - TaskName: livesp_bach.1 - State: Running 29 hours ago - Criticity: HIGH"
]
}
{
"name": "/api/v1/service/stability",
"timestamp": "2020-08-05T14:15:54Z",
"status": "KO",
"severity": "HIGH",
"messages": [
"KO - livesp_bach - TaskName: livesp_bach.1 - State: Running 29 hours ago - Criticity: HIGH"
]
}
$ dksps livesp_bach
ID NAME NODE CURRENT STATE ERROR
q2phpr7ng5fg167435zor6uii livesp_bach.1 bonite Running 29 hours ago
q5txlvhq98fggde0uejlwog8d \_ livesp_bach.1 bonite Shutdown 29 hours ago
tbvc3mjcul5rmy7h89e45pthi \_ livesp_bach.1 bonite Shutdown 2 days ago
z3nu29btw3fwcl2bnsclauv78 \_ livesp_bach.1 bonite Shutdown 2 days ago
tr8z8whv8ht6mcoa5q4z80zws \_ livesp_bach.1 bonite Failed 2 days ago "task: non-zero exit (137)"
ID NAME NODE CURRENT STATE ERROR
q2phpr7ng5fg167435zor6uii livesp_bach.1 bonite Running 29 hours ago
q5txlvhq98fggde0uejlwog8d \_ livesp_bach.1 bonite Shutdown 29 hours ago
tbvc3mjcul5rmy7h89e45pthi \_ livesp_bach.1 bonite Shutdown 2 days ago
z3nu29btw3fwcl2bnsclauv78 \_ livesp_bach.1 bonite Shutdown 2 days ago
tr8z8whv8ht6mcoa5q4z80zws \_ livesp_bach.1 bonite Failed 2 days ago "task: non-zero exit (137)"